Skip to content

feat/Broker status - version/image extraction improvements#243

Open
azun wants to merge 6 commits into
masterfrom
feature/reconcile-improvements
Open

feat/Broker status - version/image extraction improvements#243
azun wants to merge 6 commits into
masterfrom
feature/reconcile-improvements

Conversation

@azun
Copy link
Copy Markdown

@azun azun commented Apr 16, 2026

Description

Performance improvements: parallel scrape for broker version and image.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@azun azun changed the title feat/Broker version extraction improvements feat/Broker status - version/image extraction improvements Apr 16, 2026
@dobrerazvan
Copy link
Copy Markdown

How did you find that slow function (updateStatusWithDockerImageAndVersion)?

@azun azun force-pushed the feature/reconcile-improvements branch from bc10845 to fbb56a9 Compare April 20, 2026 12:51
@azun
Copy link
Copy Markdown
Author

azun commented Apr 20, 2026

How did you find that slow function (updateStatusWithDockerImageAndVersion)?

operator was logging status updates for each broker every 5-15s

Comment thread charts/kafka-operator/templates/podmonitor.yaml Outdated
Comment thread charts/kafka-operator/templates/podmonitor.yaml Outdated
Comment on lines +138 to +139
app: kafka-operator
component: operator
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
app: kafka-operator
component: operator

Comment on lines +120 to +121
app: kafka-operator
component: operator
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
app: kafka-operator
component: operator

r.KafkaCluster.Spec.GetClusterImage(), r.KafkaCluster.Spec.HeadlessServiceEnabled)
if err != nil {
return err
func (r *Reconciler) updateStatusWithDockerImageAndVersion(brokers map[int32]*banzaiv1beta1.BrokerConfig, log logr.Logger) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goroutines have no cancellation path. If the controller shuts down mid-reconcile, these goroutines run to completion (or until JMX times out). Worth passing a context.Context from the reconcile call if JMX extraction can be long-running.

result := <-ch
if result.err != nil {
return result.err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drain the hole channel first then update the broker status so all or none are updated.

@azun azun force-pushed the feature/reconcile-improvements branch from fbbb905 to a81b695 Compare May 6, 2026 14:03
@azun azun force-pushed the feature/reconcile-improvements branch from a81b695 to 7cac66e Compare May 7, 2026 08:50
@hvan
Copy link
Copy Markdown
Collaborator

hvan commented May 12, 2026

@azun - what do you think of adding condition on when a broker's kafka version needs to be rertrieved? I was thinking that the Kafka version doesn't change often and there are conditions we can implement to have it fetch it. This will reduce the need to retrieve on every reconcile.

Please check out this PR: #249

Overall, the PR looks good and will reduce the time to reconcile. Nice work.

@dobrerazvan
Copy link
Copy Markdown


Code Review — feature/reconcile-improvements

▎ Auto-generated pre-flight review · 5 files · 6 commits

🔴 HIGH — Label removal may break alertmanager-peerauthentication

charts/kafka-operator/templates/operator-deployment-with-webhook.yaml

The removed component: alertmanager label is used as a matchLabels selector in alertmanager-peerauthentication.yaml. After upgrade, the PeerAuthentication
resource will no longer match the operator pod, silently breaking permissive Istio mTLS in clusters with alertManager.permissivePeerAuthentication.create:
true. The app: prometheus removal may additionally break user-managed NetworkPolicy/PrometheusRule selectors.

Suggested fix: Add a NOTES.txt upgrade warning, or verify no active selectors depend on these labels before removing them.


🟡 MEDIUM — PodMonitor bypasses kube-rbac-proxy when both features are enabled

charts/kafka-operator/templates/podmonitor.yaml:29

The PodMonitor targets port: metrics (pod port 8080, plain HTTP). When authProxy.enabled: true (the default), this bypasses the RBAC proxy entirely — any
Prometheus with in-cluster access can scrape operator metrics without authorization. Both flags are opt-in, but the interaction is undocumented and creates a
non-obvious security misconfiguration.

Suggested fix: Add a Helm fail guard when both are enabled, or target port https (8443) with TLS/bearer-token config.


🟡 MEDIUM — PodMonitor missing outer prometheusMetrics.enabled guard

charts/kafka-operator/templates/podmonitor.yaml:1

Every other Prometheus template in this chart wraps its block with {{- if .Values.prometheusMetrics.enabled }}. The new PodMonitor only checks
prometheusMetrics.podMonitor.enabled, so it can render even when metrics are fully disabled — producing a PodMonitor pointing at a non-serving endpoint.

Suggested fix:
{{- if and .Values.prometheusMetrics.enabled .Values.prometheusMetrics.podMonitor.enabled }}


🔵 LOW — http.Client allocated per JMX call defeats connection pooling

pkg/jmxextractor/extractor.go:94

&http.Client{Timeout: 30 * time.Second} is constructed on every call. With the new concurrent goroutine-per-broker model, this spawns N independent clients
per reconcile cycle with no TCP connection reuse.

Suggested fix: Promote the client to a field on jmxExtractor, initialized once in NewJMXExtractor.


✅ Concurrency refactor looks correct

The goroutine-per-broker fan-out in updateStatusWithDockerImageAndVersion is sound: the channel is buffered to len(brokers) and each goroutine sends exactly
once, so there is no goroutine leak on early return. The JMX timeout fix is a clear improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants